A Compression Algorithm for Nucleotide Data Based on Differential Direct Coding and Variable Length Look up Table (LUT)

نویسندگان

  • Govind Prasad Arya
  • R. K. Bharti
چکیده

The ongoing exponential increase of genomic data, together with full diploid human genomes, creates new challenges not only for understanding genomic structure, function and development, but also for the storage, navigation and privacy of genomic data. In this paper, we have proposed a modified Direct Differential Coding algorithm. It is a general purposed nucleotide compression algorithm based on variable length LUT. Here the method identifies repeat regions in the individual sequence and the repeat regions are store in the lookup table (LUT). This algorithm compresses both repeat and non repeat sequences. It also handles the non base character and compresses any nucleotide sequences. It gives better result as compared to existing algorithm. The Differential Direct Coding algorithm was a fixed size lookup table algorithm i.e. it used a table of fixed size containing the 64 maximum possible combinations of the triplets obtained by combination of four characters A, G, T and C. We make this table of variable length by adding some more combinations in the look-up table, which are of the size of multiple of triplet i.e. their size is (6,9.12....) since the number of ACSII characters available were not utilized completely. Our algorithm is based on longest common substitution (LCS). It searches a longest common sequence in multiple of 3 and then substitutes an ASCII value in the place of that sequence to generate variable length LUT. In the previous algorithms, the compression ratio so obtained was smaller as compared to the variable length LUT compression algorithm which creates a relatively massive difference when the algorithm is applied on the large genomic repositories. In addition to this, our algorithm also utilizes the maximum number of ASCII characters which are available, thus increasing the efficiency. BACKGROUND The area dealing with the storage of the biological data of living organisms, forces us to use the database management system to store the data. The basic need is to warehouse this data, which carries the sequences of large sizes, lying in the databases. Genbank is one of the biggest databases for biological sequences, whose size roughly doubles in every 18 months [1]. Though in present scenario, space availability is not considered as a big problem, since the high capacity storage devices are easily available in low costs, still the compression of these biological data is of great concern due to many factors like fast searching and retrieval of the data and for performing operations on them. There are many methods to achieve the compression of the data. For e.g. pointer method, table method, etc. [3, 4, 5, 6, 7]. In this research we will particularly concentrate on the table method. There exist many algorithms based on dictionary method like Ziv-Lampel. There exist some other arithmetic encoding algorithms also like Huffman algorithm. However, these universal text compression algorithms are not suitable for compression of biological sequences as they consider the sequence as a pure text stream. If we talk about the DNA sequences, we know that it deals with only four symbols representing four nucleotide bases {A, C, T, G}. these four symbols could have been modeled as {00, 01, 10, 11} respectively, where we can observe that every nucleotide base occupying 8 bit is made to occupy 2 bits, when encoded in the above mentioned binary form. This could have been one of the most efficient encoding schemes, if and only if there were no other symbols in the sequence, other than A, G, T and C base characters. Here, though the encoding can be done, but main problem will occur during decompression as the binary code of the unexpected symbol like N or S which will definitely match with the binary code of A, G, T and C. An another type of algorithm used for DNA compression is Differential Direct (2D) Coding Algorithm [2], which can overcome this problem by differentiating between the base characters and the unexpected symbols. The 2D coding algorithm uses the group of three characters (triplets), being replaced by some other character. Shortcomings of 2D Algorithm a) As discussed, the algorithm stores all the possible combinations of {A, G, T, C, U}, though some of the combinations are not acceptable and are never used, therefore leaving some of the non-printable ASCII characters unused. Govind Prasad Arya et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (3) , 2012, 44114416

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal look-up table-based data hiding

In this study, the authors present a novel data hiding scheme using the minimum distortion look-up table (LUT) embedding that achieves good distortion-robustness performance. LUT-based data hiding is a simple and efficient way to embed information into multimedia content for various applications, such as transaction tracking and database annotation. The authors find it possible to optimally red...

متن کامل

Parallel Huffman Decoder with an Optimize Look UP

Total Bit = PnXBi t Abstract Compression is very important for system with limited channel bandwidth and/or limited storage size. One of the main components in imagehideo compression is a variable length coding (VLC). This paper would discuss about one of the most popular VLC known as Huffman Coding. In our present work, a real time hardware parallel Huffman decoder has been successfully design...

متن کامل

A Tables Look-up Algorithm based on Program Code for CAVLC Decoding

Aiming to solve the problem of high memory access and long table look-up time in table look-up of CAVLC (Context-based Adaptive Variable Length Coding ) for H.264/ AVC , a efficient look-up algorithm based on program code is presented in table look-up for CAVLC decoding in this paper, based on the analysis of the structure of CAVLC code table .The basic idea of this algorithm is that a method b...

متن کامل

A low power variable length decoder for MPEG-2 based on nonuniform fine-grain table partitioning

Variable length coding is a widely used technique in digital video compression systems. Previous work related to variable length decoders (VLD’s) are primarily aimed at high throughput applications, but the increased demand for portable multimedia systems has made power a very important factor. In this paper, a data-driven variable length decoding architecture is presented, which exploits the s...

متن کامل

الگوریتم کنترل نرخ بیت متغیر ویدئو در سطح گروه تصاویر برای استاندارد فشرده‎سازی H.265

A rate control algorithm at the group of picture (GOP) level is proposed in this paper for variable bit rate applications of the H.265/HEVC video coding standard with buffer constraint. Due to structural changes in the HEVC compared to the previous standards, new rate control algorithms are needed to be designed. In the proposed algorithm, quantization parameter (QP) of each GOP is obtained by ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012